Loading Multiple BEL Scripts

Author: Charles Tapley Hoyt

Parsing a BEL Script with PyBEL is as simple as:

>>> import pybel
>>> graph = pybel.from_url('...')

Scenario

However, the simple functions exposed at the package-level obscure the caching functionality. In the situation where multiple BEL Scripts would be loaded, the following code would be very slow:

import pybel
my_urls = ['... url 1 ...', '... url 2 ...', ...]
graphs = [
    pybel.from_url(url)
    for url in my_urls
]

This is because PyBEL takes care of making a connection to a local SQLite cache. It has to build the cache in-memory each time the function is run.

Solution

One solution is to make a CacheManager object that can be used across each of the runs so the cache doesn't need to be loaded in to memory each time.

import pybel
from pybel.manager import CacheManager
manager = CacheManager()
my_urls = ['... url 1 ...', '... url 2 ...', ...]
graphs = [
    pybel.from_url(url, connection=manager)
    for url in my_urls
]

Other common patterns, which includes loading a list of graphs and taking their union, have been implemented in the PyBEL Tools IO Utilities submodule. See: http://pybel-tools.readthedocs.io/en/latest/ioutils.html

Extra Credit

The cache manager uses SQLite by default because it requires zero configuration. Better performance can be achieved by switching to using a relational database management system like MySQL or Postgres.

This can be attained by using a RFC-1738 database connection string as the connection argument to the CacheManager function

from pybel.manager import CacheManager
connection = 'mysql+pymysql://<username>:<password>@<host>/<dbname>?charset=utf8[&<options>]'
manager = CacheManager(connection=connection)

A default connection string can be set by following the instructions in the documentation at http://pybel.readthedocs.io/en/latest/constants.html#configuration-loading